No unapproved extension of the deadline is allowed. Late submission will lead to 0 credit.
Discussion is encouraged on Ed as part of the Q/A. However, all assignments should be done individually.
Plagiarism is a **serious offense**. You are responsible for completing your own work. You are not allowed to copy and paste, or paraphrase, or submit materials created or published by others, as if you created the materials. All materials submitted must be your own.
All incidents of suspected dishonesty, plagiarism, or violations of the Georgia Tech Honor Code will be subject to the institute’s Academic Integrity procedures. If we observe any (even small) similarities/plagiarisms detected by Gradescope or our TAs, **WE WILL DIRECTLY REPORT ALL CASES TO OSI**, which may, unfortunately, lead to a very harsh outcome. **Consequences can be severe, e.g., academic probation or dismissal, grade penalties, a 0 grade for assignments concerned, and prohibition from withdrawing from the class.**
This assignment consists of both programming and theory questions.
Unless a theory question explicitly states that no work is required to be shown, you must provide an explanation, justification, or calculation for your answer.
To switch between cell for code and for markdown, see the menu -> Cell -> Cell Type
If a question requires a picture, you could use this syntax <img src="" style="width: 300px;"/> to include them within your ipython notebook.
Your write up must be submitted in PDF form. You may use either Latex, markdown, or any word processing software. We will **NOT** accept handwritten work. Make sure that your work is formatted correctly, for example submit $\sum_{i=0} x_i$ instead of \text{sum_{i=0} x_i}
You will submit your code for the autograder in the Assignment 2 Programming sections. Please refer to the Deliverables and Point Distribution section for what parts are considered required, bonus for undergrads, and bonus for all.
We provided you different .py files and we added libraries in those files please DO NOT remove those lines and add your code after those lines. Note that these are the only allowed libraries that you can use for the homework.
You are allowed to make as many submissions until the deadline as you like. Additionally, note that the autograder tests each function separately, therefore it can serve as a useful tool to help you debug your code if you are not sure of what part of your implementation might have an issue.
For the "Assignment 2 - Non-programming" part, you will need to submit to Gradescope a PDF copy of your Jupyter Notebook with the cells ran. See this EdStem Post for multiple ways on to convert your .ipynb into a .pdf file. Please refer to the Deliverables and Point Distribution section for an outline of the non-programming questions.
pairwise_dist [5 pts] - programming
KMeans Implementation [30pts] - programming
Rand Statistic [5 pts] - programming
DBScan [10 pts] - programming BONUS FOR UNDERGRAD
2.1 Performing EM Update [15 pts] - non-programming
2.1.1 [3pts] - non-programming
2.1.2 [3pts] - non-programming
2.1.3 [9pts] - non-programming
2.2 Gradient Descent and EM algorithm [5 pts] - non-programming BONUS FOR ALL
3.1 Helper Functions [15pts] - programming & non-programming
3.1.1. softmax [5pts]
3.1.2. logsumexp [3pts + 2pts] - programming & non-programming
3.1.3. normalPDF [5pts] - for CS4641 students only
3.1.3. multinormalPDF [5pts] - for CS7641 students only
3.2 GMM Implementation [30pts] - programming
3.2.1. init_components [5pts]
3.2.2._ll_joint [10pts]
3.2.3. Setup iterative steps for EM algorithm [15pts]
3.3 Image Compression and Pixel clustering [10pts] - non-programming
4.1: KNN [12pts] - programming
4.1.a. complete, incomplete, unlabeled_ [3pts]
4.1.b. CleanData __call__ [7pts]
4.1.c. MeanCleanData [2pts]
4.2: Getting acquainted with semi-supervised learning approaches [15pts] - programming & non-programming
4.2.a. Write highlight summary [5pts] - non-programming
4.2.b. Implement EM algorithm [10pts] - programming
4.3: Demonstrating the performance of the algorithm [5pts] - programming
accuracy_semi_supervised [2.5pts]
accuracy_GNB [2.5pts]
4.4: Interpretation of Results [2pts] - non-programming
Note: It is highly recommended that you do Q4 (if not for the HW then before the project) as it teaches you imperfect data handling and a good understanding of how the models you have learnt can be used together for better results.
This notebook is tested under python 3.**.**, and the corresponding packages can be downloaded from miniconda. You may also want to get yourself familiar with several packages:
You can create a python conda environment with the necessary packages using the instructions in the environment/environment_setup.md file.
Please implement the functions that have "raise NotImplementedError", and after you finish the coding, please delete or comment "raise NotImplementedError".
###############################
### DO NOT CHANGE THIS CELL ###
###############################
from __future__ import absolute_import
from __future__ import print_function
from __future__ import division
%matplotlib inline
import sys
import matplotlib
import numpy as np
import matplotlib.pyplot as plt
import utilities.localtests as localtests
from mpl_toolkits.mplot3d import axes3d
from tqdm import tqdm
print('Version information')
print('python: {}'.format(sys.version))
print('matplotlib: {}'.format(matplotlib.__version__))
print('numpy: {}'.format(np.__version__))
# Load image
import imageio
%load_ext autoreload
%autoreload 2
Version information python: 3.10.13 | packaged by Anaconda, Inc. | (main, Sep 11 2023, 13:24:38) [MSC v.1916 64 bit (AMD64)] matplotlib: 3.7.2 numpy: 1.26.0
Rob Boss, a perpetual Ph.D student at Georgia Tech University, is teaching his dissertation robot, Pablo, how to paint in preparation for the Clough Art Crawl. Unfortunately, Rob is on a shoestring PhD budget and can only afford to buy Pablo a limited variety of colors to practice with. Luckily Rob has taken Machine Learning and remembers that KMeans clustering can be used to compress images down to a few colors. Here you will help Rob paint this happy lovely dog by implementing KMeans.
(Artwork generated by Fotor)

KMeans is trying to solve the following optimization problem:
\begin{align} \arg \min*S \sum*{i=1}^K \sum\_{x_j \in S_i} ||x_j - \mu_i||^2 \end{align}where one needs to partition the N observations into K clusters: $S = \{S_1, S_2, \ldots, S_K\}$ and each cluster has $\mu_i$ as its center.
In this section, you are asked to implement pairwise_dist function.
Given $X \in \mathbb{R}^{N \times D}$ and $Y \in \mathbb{R}^{M \times D}$, obtain the pairwise distance matrix $dist \in \mathbb{R}^{N \times M}$ using the euclidean distance metric, where $dist_{i, j} = ||X_i - Y_j||_2$.
DO NOT USE LOOPS in your implementation, using for-loops or while-loops will result in 0 credit for this portion. Use array broadcasting instead.
Hint: Avoid creating a third dimension (3D arrays) while calculating pairwise distance, otherwise it will lead to time out.
Hint: To calculate X^2 and Y^2 you can refer to the sum of squares function from assignment 1.
We have provided some unit tests in localtests.py for you to check your implementation. See Using the Local Tests for more details.
localtests.KMeansTests().test_pairwise_dist()
localtests.KMeansTests().test_pairwise_speed()
UnitTest passed successfully! UnitTest passed successfully!
In this section, you are asked to implement several methods in kmeans.py
The Kmeans algorithm is sensitive to how the centers are initialized. The naive approach is to randomly initialize the centers. However, a bad initialization can increase the time required for convergence or may even converge to a non-optimal solution.
Hint: Please initialize centers by randomly sampling points from the dataset in case the autograder fails.
Hint: We need to initialize the centers without repetition.
The algorithm for KMPP that you will implement can be described as follows:
After you've chosen your centers, you will need to update the membership of each point based on the closest center. You will implement this in update_assignment. See docstring for more details.
Since cluster memberships may have changed, you will need to update the cluster centers. You will implement this in update_centers. See docstring for more details.
We will consider KMeans to be converged when the change in loss drops below a threshold value. The loss will be defined as the sum of the squared distances between each point and its respective center.
Hint: You may use a loop over K to keep track of the data in each cluster. Avoid looping over N individual datapoints.
In the train method you will use all of the previously implemented steps to train your KMeans algorithm until convergence. Since the centers have already been initialized in the init function the general steps for the train method is as follows:
We have provided the following local tests to help you check your implementation. Provided unit-tests are meant as a guide and are not intended to be comprehensive. See Using the Local Tests for more details.
localtests.KMeansTests().test_init()
localtests.KMeansTests().test_update_centers()
localtests.KMeansTests().test_kmeans_loss()
[1.03616098 2.99491511]
UnitTest passed successfully! UnitTest passed successfully!
We will now take a look at how image quality is impacted by the number of clusters
###############################
### DO NOT CHANGE THIS CELL ###
###############################
# Note that because of a different file structure, students' paths will be different
from utilities.utilities import *
image_values = image_to_matrix("./data/images/dog.png")
r = image_values.shape[0]
c = image_values.shape[1]
ch = image_values.shape[2]
# flatten the image_values
image_values = image_values.reshape(r * c, ch)
print("Loading...")
image_2 = update_image_values(2, image_values, r, c, ch).reshape(r, c, ch)
image_5 = update_image_values(5, image_values, r, c, ch).reshape(r, c, ch)
image_10 = update_image_values(10, image_values, r, c, ch).reshape(r, c, ch)
image_20 = update_image_values(20, image_values, r, c, ch).reshape(r, c, ch)
plot_image(
[image_2, image_5, image_10, image_20], ["K = 2", "K = 5", "K = 10", "K = 20"]
)
Loading...
In this section, you will create a function to assess the quality of a clustering algorithm using the Rand Statistic. The Rand Statistic quantifies the goodness or quality of a clustering algorithm when compared to a ground truth.
As discussed in class, the Rand Index is calculated as follows: $$Rand = \frac{TP + TN}{TP + FP + FN + TN}$$
TP (True Positive) represents the number of pairs of data points that are correctly clustered together in both the algorithm's result and the ground truth.
TN (True Negative) is the count of pairs of data points that are correctly placed in separate clusters in both the algorithm's result and the ground truth.
FP (False Positive) counts the pairs of data points that are incorrectly clustered together in the algorithm's result but correctly separated in the ground truth.
FN (False Negative) is the number of pairs of data points that are incorrectly separated in the algorithm's result but correctly clustered together in the ground truth.
We have provided the following local tests to help you check your implementation. Provided unit-tests are meant as a guide and are not intended to be comprehensive. See Using the Local Tests for more details.
Refer to the class notes for more information on the Rand Statistic
localtests.KMeansTests().test_rand_statistic()
Expected value: 0.5909090909090909 Your value: 0.5909090909090909 UnitTest passed successfully!
You've now done the best you can selecting the perfect starting points and the right number of clusters. However, one of the limitations of K-Means Clustering is that it depends largely on the shape of the dataset. A common example of this is trying to cluster one circle within another (concentric circles). A K-means classifier will fail to do this and will end up effectively drawing a line that crosses the circles. You can visualize this limitation in the cell below.
###############################
### DO NOT CHANGE THIS CELL ###
###############################
# visualize limitation of kmeans
from kmeans import *
from sklearn.datasets import make_circles, make_moons
X1, y1 = make_circles(factor=0.5, noise=0.05, n_samples=1500)
X2, y2 = make_moons(noise=0.05, n_samples=1500)
def visualise(
X, C, K=None
): # Visualization of clustering. You don't need to change this function
fig, ax = plt.subplots()
ax.scatter(X[:, 0], X[:, 1], c=C, cmap="rainbow")
if K:
plt.title("Visualization of K = " + str(K), fontsize=15)
plt.show()
pass
kmeans = KMeans(X1, 2)
centers1, cluster_idx1, loss1 = kmeans.train()
visualise(X1, cluster_idx1, 2)
kmeans = KMeans(X2, 2)
centers2, cluster_idx2, loss2 = kmeans.train()
visualise(X2, cluster_idx2, 2)
Let us try to solve these limitations using another clustering algorithm: DBSCAN. As mentioned in lecture, DBSCAN tries to find dense regions in the data space, separated by regions of lower density. DBSCAN is parameterized by two parameters (eps and minPts):
Refer to the class slides for the DBSCAN pseudocode to complete fit(), expandCluster(), and regionQuery() in dbscan.py.
HINTS:
The following unittests will help get you started, but is in no way comprehensive. You are encouraged to extend and create your own test cases. See Using the Local Tests for more details.
localtests.DBScanTests().test_region_query()
localtests.DBScanTests().test_expand_cluster()
UnitTest passed successfully!
neightbor length 4
my cluster_idx [ 0. -1. 0. -1. -1. 0. -1. 0. 0. -1.]
my visited {0, 2, 5, 7, 8}
UnitTest passed successfully!
Then, test your fitting by running the cell below. You should be able to get a perfect clustering for the two circles dataset, which you can observe quantitatively by checking whether the clusters returned by cluster_idx and the ground truth clusters are the same and qualitatively by visualizing the clusters.
###############################
### DO NOT CHANGE THIS CELL ###
###############################
BEST_EPS = 0.11
BEST_POINTS = 3
from dbscan import DBSCAN
dbscan = DBSCAN(BEST_EPS, BEST_POINTS, X1)
cluster_idx = dbscan.fit()
## Note that one of the two cells should print True for a correct implementation
print(np.array_equal(y1, cluster_idx)) # Checks if y1 == cluster_idx
print(
np.array_equal(y1, 1 - cluster_idx)
) ## Checks if y1 is the exact opposite of cluster_idx (1s instead of 0s and vice-versa)
visualise(X1, cluster_idx)
neightbor length 27 neightbor length 42 True False
SOLUTIONS CANNOT BE HANDWRITTEN
A univariate Gaussian Mixture Model (GMM) has two components, both of which have their own mean and standard deviation. The model is defined by the following parameters: $$ \mathbf{z} \sim Bernoulli(\alpha) = \begin{cases} \alpha &\text{if} \, z=0 \\ 1-\alpha &\text{if} \, z=1 \\\end{cases}$$ $$ \mathbf{x_n \mid z=0} \sim \mathcal{N}(6\upsilon, 5\omega^{2}) $$ $$ \mathbf{x_n \mid z=1} \sim \mathcal{N}(3\upsilon, 7\omega^{2}) $$
For a dataset of N datapoints, find the following:
2.1.1. Write the marginal probability of x, i.e. $p(x)$ [3pts]
-- Express your answers in terms of $\mathcal{N}(6\upsilon, 5\omega^{2})$ and $\mathcal{N}(3\mu, 7\omega^{2})$
-- HINT: Start with the Sum Rule
2.1.2. E-Step: Compute the posterior probabilities, i.e, $p(z_0|x), p(z_1|x)$ [3pts]
-- Express your answers in terms of $\mathcal{N}(6\upsilon, 5\omega^{2})$ and $\mathcal{N}(3\upsilon, 7\omega^{2})$
-- HINT: Try to apply Beyes Rule
2.1.3. M-Step: Compute the updated value of $\omega^{2}$. (You can keep $\mu$ fixed when you calculate the derivative.) [9pts]
-- Note that $\omega^2$ is a shared variable between the two distributions, your final answer should be one equation including both Gaussian distributions
-- Express your answers in terms of $\tau$, $x$, and $\upsilon$ (you will need to expand $\mathcal{N}(6\upsilon, 5\omega^{2})$ and $\mathcal{N}(3\upsilon, 7\omega^{2})$ into its PDF form)
-- HINT: Start from the below equation, note that $\theta$ is shorthand for various variables, and take the derivative w.r.t. $\omega^2$
Recall that $p(x_n \mid z_k, \upsilon, \omega^2) \rightarrow \mathcal{N}(x_n \mid \mu_k, \sigma_k)$ has been defined at the beginning of the problem.
You can refer to this lecture to gain an understanding of the EM Algorithm. For your convenience, I have included the link below:
https://mahdi-roozbahani.github.io/CS46417641-spring2023/course/09-gaussian-mixture.pdf
2.2.1: $$p(x) = \pi_0 f_0(x) + \pi_1 f_1(x)$$ $$p(x) = \sum_k N(x|\mu_k,\sigma_k)\pi_k$$ $$p(x) = N(x|z_k)p(z_k)$$ $$p(x) = \alpha N(6v, 5\omega^2) + (1-\alpha) N(3v,7\omega^2)$$
2.2.2: $$\tau_0 = \frac{p(x,z_0)}{\sum_{k=0}^{k=1}p(x,z_k)}$$ $$\tau_0 = \frac{p(x,z_0)}{p(x)}$$ $$\tau_0 = p(z_0|x)$$ $$\tau_0 = \frac{p(x|z_0)p(z_0)}{p(x)}$$ $$\tau_0 = \frac{\alpha N(6v, 5\omega^2)}{\alpha N(6v, 5\omega^2) + (1-\alpha) N(3v,7\omega^2)}$$ $$\tau_1 = \frac{(1-\alpha) N(3v,7\omega^2)}{\alpha N(6v, 5\omega^2) + (1-\alpha) N(3v,7\omega^2)}$$
2.2.3 $$\frac{\partial l}{\partial \omega} = \sum^{N}\sum^{k} \tau_k^{'}\ln[p(z_k)N(6v,5\omega^2)+p(z_k)N(3v,7\omega^{2})] + \frac{\tau_k N(6v,5\omega^{2})^{'}+N(3v,7\omega^2)^{'}}{p(z_k)N(6v,5\omega^{2})+p(z_k)N(3v,7\omega^{2})}$$ I was unable to fully solve this out, but for the sake of partial credit I know that I must solve out all of the prime terms in terms of omega (by expanding the N terms using the pdf equation), and it should bring me to the solution.
2.2. Compared to the Gradient Ascent algorithm, what's the computational advantage of using EM algorithm for the problem presented in 2.1?Please provide your own analysis and it can be qualitatively. [5pts]
-- HINT: Think about the difference in updating parameters during each iteration
Please make sure to read the problem setup in detail. Many questions for this section may have already been answered in the description and hints and docstrings.
A Gaussian Mixture Model (GMM) is a probabilistic model that assumes all the data points are generated from a mixture of a finite number of Gaussian Distribution. In a nutshell, GMM is a soft clustering algorithm in a sense that each data point is assigned to a cluster with a probability. In order to do that, we need to convert our clustering problem into an inference problem.
Given $N$ samples $X = [x_1, x_2, \ldots, x_N]^T$, where $x_i \in \mathbb{R}^D$. Let $\pi$ be a K-dimensional probability density function and $(\mu_k; \Sigma_k)$ be the mean and covariance matrix of the $k^{th}$ Gaussian distribution in $\mathbb{R}^d$.
The GMM object implements EM algorithms for fitting the model and MLE for optimizing its parameters. It also has some particular hypothesis on how the data was generated:
Our goal is to find a $K$-dimension Gaussian distributions to model our data $X$. This can be done by learning the parameters $\pi, \mu$ and $\Sigma$ through likelihood function. Detailed derivation can be found in our slide of GMM. The log-likelihood function now becomes:
\begin{align} \text{ln } p(x*1, \dots, x_N | \pi, \mu, \Sigma) = \sum*{i=1}^N \text{ln } \big( \sum\_{k=1}^{K} \pi(k) \mathcal{N}(x_i | \mu_k, \Sigma_k)\big) \end{align}From the lecture we know that MLEs for GMM all depend on each other and the responsibility $\tau$. Thus, we need to use an iterative algorithm (the EM algorithm) to find the estimate of parameters that maximize our likelihood function. All detailed derivations can be found in the lecture slide of GMM.
In this step, we need to calculate the responsibility $\tau$, which is the conditional probability that a data point belongs to a specific cluster $k$ if we are given the datapoint, i.e. $P(z_k|x)$. The formula for $\tau$ is given below:
$$ \tau\left(z_k\right)=\frac{\pi_{k} \cal{N}\left(x | \mu_{k}, \Sigma_{k}\right)}{\sum_{j=1}^{K} \pi_{j} \cal{N}\left(x | \mu_{j}, \Sigma_{j}\right)}, \quad \text{for } k = 1, \dots, K $$Note that each data point should have one probability for each component/cluster. For this homework, you will work with $\tau\left(z_k\right)$ which has a size of $N\times K$ and you should have all the responsibility values in one matrix. We use gamma as $\tau$ in this homework.
After we obtained the responsibility, we can find the update of parameters, which are given below:
\begin{align} \mu*k^{new} &= \dfrac{\sum*{n=1}^N \tau(z*k)x_n}{N_k} \\ \Sigma_k^{new} &= \dfrac{1}{N_k}\sum*{n=1}^N \tau (z*k)^T(x_n - \mu_k^{new})^T(x_n-\mu_k^{new}) \\ \pi_k^{new} &= \dfrac{N_k}{N} \end{align}where $N_k = \sum*{n=1}^N \tau(z_k)$. Note that the updated value for $\mu_k$ is used when updating $\Sigma_k$. The multiplication of $\tau (z_k)^T(x_n - \mu_k^{new})^T$ is element-wise so it will preserve the dimensions of $(x_n - \mu_k^{new})^T$.
Special Notes
Hints
DO NOT USE FOR LOOPS OVER N. You can always find a way to avoid looping over the observation datapoints in our homework problem. If you have to loop over D or K, that is fine.
You can initiate $\pi(k)$ the same for each $k$, i.e. $\pi(k) = \frac{1}{K}, \forall k = 1, 2, \ldots, K$.
In part 3 you are asked to generate the model for pixel clustering of image. We will need to use a multivariate Gaussian because each image will hava $N$ pixels and $D=3$ features which corresponds to red, green, and blue color intensities. It means that each image is a $(N\times3)$ dataset matrix. In the following parts, remember $D=3$ in this problem.
To avoid using for loops in your code, we recommend you take a look at the concept Array Broadcasting in Numpy. Also, certain calculations that required different shapes of arrays can also be achieved by broadcasting.
Be careful of the dimensions of your parameters. Before you test anything on the autograder, please look at the instructions below on the shapes of the variables you need to output and how to format your return statement. Print the shape of an array by print(array.shape) could enhance the functionality of your code and help you debugging. Also notice that a numpy array in shape $(N,1)$ is NOT the same as that in shape $(N,)$ so be careful and consistent on what you are using. You can see the detailed explanation here. Difference between numpy.array shape (R, 1) and (R,)
To facilitate some of the operations in the GMM implementation, we would like you to implement the following three helper functions. In these functions, "logit" refers to an input array of size $(N, D)$ that reperesents the unnormalized scores, that are passed to the softmax( ) or logsumexp( ) function. Remember the goal of helper functions is to facilitate our calculation so DO NOT USE FOR LOOP OVER N.
Given $logit \in \mathbb{R}^{N \times D}$, calculate $prob \in \mathbb{R}^{N \times D}$, where $prob_{i, j} = \frac{\exp(logit_{i, j})}{\sum_{d=1}^D exp(logit_{i, d})}$.
Notes:
Special Notes
Given $logit \in \mathbb{R}^{N \times D}$, calculate $s \in \mathbb{R}^N$, where $s_i = \log \big( \sum_{j=1}^D \exp(logit_{i, j}) \big)$. Again, pay attention to the numerical problem. You may face similar condition as in the softmax function. In this case, add the maximum for each row of $logit$ back for your functions
Special Notes
Answer:
$$\text{using this example given, we get a logsum expression of:}$$$$\log(e^{log_{11}-log{13}}+e^{log_{11}-log{13}}+1)$$$$\text{what subtracting this largest value does is scale the values, and ultimately the result to the left by the max value in each row.}$$$$\text{This is true because we are taking the values as an exponent and then taking the log in the end.}$$$$\text{Meaning we are shifting the entire expression by some constant, but since we know we shifted the values left by the max, we know that the constant is the max.}$$$$\text{In order to make sure we get the correct values, we must shift the values back to the right by the max.}$$$$\text{This keeps the ratio of distance from each value to each other, but places them back where they should be in terms of exact values.}$$$$\text{Essentially we subtract the max to get easier values to compute, calculate the logsum and shift back while preserving the ratio of the logs.}$$You should be able to write your own function based on the following formula, and you are NOT allowed to use outside resource packages other than those we provided.
(for undergrads only) normalPDF
Using the covariance matrix as a diagonal matrix with variances of the individual variables appearing on the main diagonal of the matrix and zeros everywhere else means that we assume the features are independent. In this case, the multivariate normal density function simplifies to the expression below: $$\mathcal{N}(x: \mu, \Sigma) = \prod_{i=1}^D \frac{1}{\sqrt{2\pi\sigma_i^2}}\exp{\left( -\frac{1}{2\sigma_i^2} (x_i-\mu_i)^2\right)}$$ where $\sigma^2_i$ is the variance for the $i^{th}$ feature, which is the diagonal element of the covariance matrix.
(for grads only) multinormalPDF
Given the dataset $X \in \mathbb{R}^{N \times D}$, the mean vector $\mu \in \mathbb{R}^{D}$ and covariance matrix $\Sigma \in \mathbb{R}^{D \times D}$ for a multivariate Gaussian distrubution, calculate the probability $p \in \mathbb{R}^{N}$ of each data. The PDF is given by $$\mathcal{N}(X: \mu, \Sigma) = \frac{1}{(2\pi)^{D/2}}|\Sigma|^{-1/2}\exp{\left(-\frac{1}{2}(X-\mu)\Sigma^{-1}(X-\mu)^T\right)}$$ where $|\Sigma|$ is the determinant of the covariance matrix.
Hints:
If you encounter "LinAlgError", you can mitigate your number/array by summing a small value before taking the operation, e.g. np.linalg.inv($\Sigma_k$ + SIGMA_CONST). You can arrest and handle such error by using Try and Exception Block in Python.Please only add SIGMA_CONST to all elements when sigma_i is not invertible.
In the above calculation, you must avoid computing a $(N,N)$ matrix. Using the above equation for large N will crash your kernel and/or give you a memory error on Gradescope. Instead, you can do this same operation by calculating $(X-\mu)\Sigma^{-1}$, a $(N,D)$ matrix, transpose it to be a $(D,N)$ matrix and do an element-wise multiplication with $(X-\mu)^T$, which is also a $(D,N)$ matrix. Lastly, you will need to sum over the 0 axis to get a $(1,N)$ matrix before proceeding with the rest of the calculation. This uses the fact that doing an element-wise multiplication and summing over the 0 axis is the same as taking the diagonal of the $(N,N)$ matrix from the matrix multiplication.
In Numpy implementation for each individual $\mu$, you can either use a 2-D array with dimension $(1,D)$ for each Gaussian Distribution, or a 1-D array with length $D$. Same to other array parameters. Both ways should be acceptable but pay attention to the shape mismatch problem and be consistent all the time when you implement such arrays.
Please DO NOT use self.D in your implementation of multinormalPDF() .
Things to do in this problem:
Examples of how you can initialize the parameters.
create_pi(): Set the prior probability $\pi$ the same for each class.create_mu(): Initialize $\mu$ by randomly selecting K numbers of observations as the initial mean vectors. You can use int(np.random.uniform()) to get the row index number of the datapoints randomly.create_sigma(): Initialize the covariance matrix with np.eye() for each k. For grads, you can also initialize the $\Sigma$ by K diagonal matrices. It will become a full matrix after one iteration, as long as you adopt the correct computation._init_components() methodThe log-likelihood function is given by:
$$ \begin{align} \ell(\theta) = \sum_{i=1}^N \text{ln } \big( \sum_{k=1}^{K} \pi(k) \mathcal{N}(x_i | \mu_k, \Sigma_k)\big) \end{align} $$In this part, we will generate a $(N,K)$ matrix where each datapoint $x_i, \forall i = 1, \dots, N$ has $K$ log-likelihood numbers. Thus, for each $i = 1, \dots, N$ and $k = 1, \dots, K$,
$$ \text{log-likelihood}[i,k] = \log{\pi_k}+\log{\cal{N}(x_i|\mu_k, \Sigma_k)} $$Hints:
You can find the detail instruction in the above description box.
Hints:
Use these to test if your implementation of functions in GMM work as expected. See Using the Local Tests for more details.
###############################
### DO NOT CHANGE THIS CELL ###
###############################
from gmm import GMM
gmm_tester = localtests.GMMTests()
#gmm_tester.test_helper_functions()
gmm_tester.test_init_components()
Your _init_component's pi works within expected range: True Your _init_component's mu works within expected range: False Your _init_component's sigma works within the expected range: True
gmm_tester.test_undergrad()
Your normal pdf works within the expected range: False Your lljoint works within the expected range: True Your E step works within the expected range: True Your M step works within the expected range: True
gmm_tester.test_grad()
--------------------------------------------------------------------------- NotImplementedError Traceback (most recent call last) c:\Users\shres\Documents\Classes\CS 4641\HW2\hw2_code\HW2_Fall2023_Student.ipynb Cell 44 line 1 ----> <a href='vscode-notebook-cell:/c%3A/Users/shres/Documents/Classes/CS%204641/HW2/hw2_code/HW2_Fall2023_Student.ipynb#Y302sZmlsZQ%3D%3D?line=0'>1</a> gmm_tester.test_grad() File c:\Users\shres\Documents\Classes\CS 4641\HW2\hw2_code\utilities\localtests.py:288, in GMMTests.test_grad(self) 276 # # test mutlinormalPDF 277 sigma_grad = np.array([[[0.30792796, 0.07909229, -0.11016917], 278 [0.07909229, 0.86422655, 0.06975468], 279 [-0.11016917, 0.06975468, 0.63212106]], (...) 286 [0.07909229, 0.86422655, 0.06975468], 287 [-0.11016917, 0.06975468, 0.63212106]]]) --> 288 my_multinormalpdf = GMM(data, 3).multinormalPDF(points, mu[0], sigma_grad[0]) 289 expected_multinormal_pdf = np.array([1.24124610e-01, 1.09187066e-02, 2.83199962e-02, 4.84688341e-02, 290 3.96140713e-02, 3.55094160e-04, 1.15479475e-02, 5.06551884e-02, 291 3.37348839e-04, 6.34219531e-03, 1.20587696e-04, 8.14168432e-03, 292 5.76457373e-04, 1.82882581e-03, 1.49926366e-02]) 294 print("Your multinormal pdf works within the expected range: ", np.allclose(expected_multinormal_pdf, my_multinormalpdf)) File c:\Users\shres\Documents\Classes\CS 4641\HW2\hw2_code\gmm.py:108, in GMM.multinormalPDF(self, points, mu_i, sigma_i) 93 def multinormalPDF(self, points, mu_i, sigma_i): # [5pts] 94 """ 95 Args: 96 points: N x D numpy array (...) 105 try using another method involving the current arguments to get the value of D 106 """ --> 108 raise NotImplementedError NotImplementedError:
Images typically need a lot of bandwidth to be transmitted over the network. In order to optimize this process, most image processors perform lossy compression of images (lossy implies some information is lost in the process of compression).
In this section, you will use your GMM algorithm implementation to do pixel clustering and compress the images. That is to say, you would develop a lossy image compression algorithm. (Hint: you can adjust the number of clusters formed and justify your answer based on visual inspection of the resulting images or on a different metric of your choosing)
Special Notes
###############################
### DO NOT CHANGE THIS CELL ###
###############################
# helper function for performing pixel clustering.
def cluster_pixels_gmm(image, K, full_matrix=False):
"""Clusters pixels in the input image
Args:
image: input image of shape(H, W, 3)
K: number of components
Return:
clustered_img: image of shape(H, W, 3) after pixel clustering
"""
im_height, im_width, im_channel = image.shape
flat_img = np.reshape(image, [-1, im_channel]).astype(np.float32)
gamma, (pi, mu, sigma) = GMM(flat_img, K=K, max_iters=10)(full_matrix)
cluster_ids = np.argmax(gamma, axis=1)
centers = mu
gmm_img = np.reshape(centers[cluster_ids], (im_height, im_width, im_channel))
return gmm_img
# helper function for plotting images. You don't have to modify it
def plot_images(img_list, title_list, figsize=(20, 10)):
assert len(img_list) == len(title_list)
fig, axes = plt.subplots(1, len(title_list), figsize=figsize)
for i, ax in enumerate(axes):
ax.imshow(img_list[i] / 255.0)
ax.set_title(title_list[i])
ax.axis("off")
# the direction of two images. Both of them are from ImageNet
img1_dir = "./data/images/gmm-example5.png"
img2_dir = "./data/images/gmm-example4.png"
# example of loading image
image1 = imageio.imread("./data/images/gmm-example1.png")
# this is for you to implement
def perform_compression(image, min_clusters=5, max_clusters=15):
"""
Using the helper function above to find the optimal number of clusters that can appropriately produce a single image.
You can simply examine the answer based on your visual inspection (i.e. looking at the resulting images) or provide any metrics you prefer.
Args:
image: input image of shape(H, W, 3)
min_clusters, max_clusters: the minimum and maximum number of clusters you should test with. Default are 5a dn 15.
(Usually the maximum number of clusters would not exeed 15)
Return:
plot: comparison between original image and image pixel clustering.
optional: any other information/metric/plot you think is necessary.
UNDERGRADS HINT: set full_matrix = False when using the helper functions
"""
# TODO: Finish this function
compressed_5 = cluster_pixels_gmm(image, K=5, full_matrix=False)
compressed_10 = cluster_pixels_gmm(image, K=10, full_matrix=False)
compressed_15 = cluster_pixels_gmm(image, K=15, full_matrix=False)
plot1 = plot_images([image, compressed_5], ["original", "k=5"])
plot2 = plot_images([image, compressed_15], ["original","k=15"])
return plot1,plot2
image1 = imageio.imread(img1_dir)
perform_compression(image1, 5, 10)
image2 = imageio.imread(img2_dir)
perform_compression(image2, 5, 10)
C:\Users\shres\AppData\Local\Temp\ipykernel_17340\1056173917.py:6: DeprecationWarning: Starting with ImageIO v3 the behavior of this function will switch to that of iio.v3.imread. To keep the current behavior (and make this warning disappear) use `import imageio.v2 as imageio` or call `imageio.v2.imread` directly.
image1 = imageio.imread("./data/images/gmm-example1.png")
C:\Users\shres\AppData\Local\Temp\ipykernel_17340\1056173917.py:36: DeprecationWarning: Starting with ImageIO v3 the behavior of this function will switch to that of iio.v3.imread. To keep the current behavior (and make this warning disappear) use `import imageio.v2 as imageio` or call `imageio.v2.imread` directly.
image1 = imageio.imread(img1_dir)
iter 9, loss: -546051.1334: 100%|██████████| 10/10 [00:01<00:00, 9.10it/s]
iter 9, loss: -563977.2629: 100%|██████████| 10/10 [00:02<00:00, 4.52it/s]
iter 9, loss: -597013.1157: 100%|██████████| 10/10 [00:03<00:00, 3.14it/s]
Clipping input data to the valid range for imshow with RGB data ([0..1] for floats or [0..255] for integers).
Clipping input data to the valid range for imshow with RGB data ([0..1] for floats or [0..255] for integers).
C:\Users\shres\AppData\Local\Temp\ipykernel_17340\1056173917.py:39: DeprecationWarning: Starting with ImageIO v3 the behavior of this function will switch to that of iio.v3.imread. To keep the current behavior (and make this warning disappear) use `import imageio.v2 as imageio` or call `imageio.v2.imread` directly.
image2 = imageio.imread(img2_dir)
iter 9, loss: -499913.5638: 100%|██████████| 10/10 [00:01<00:00, 8.68it/s]
iter 9, loss: -564336.9176: 100%|██████████| 10/10 [00:02<00:00, 4.29it/s]
iter 9, loss: -683974.4501: 100%|██████████| 10/10 [00:03<00:00, 3.02it/s]
Clipping input data to the valid range for imshow with RGB data ([0..1] for floats or [0..255] for integers).
Clipping input data to the valid range for imshow with RGB data ([0..1] for floats or [0..255] for integers).
(None, None)
Compare full covariance matrix with diagonal covariance matrix. Can you explain why the images are different with same clusters? Note: You will have to implement both multinormalPDF and normalPDF, and add a few arguments in the original _ll_joint(), _M_step(), _E_step() function. You will earn full credit only if you implement all functions AND provide an explanation.
###############################
### DO NOT CHANGE THIS CELL ###
###############################
def compare_matrix(image, K):
"""
Args:
image: input image of shape(H, W, 3)
K: number of components
Return:
plot: comparison between full covariance matrix and diagonal covariance matrix.
"""
# full covariance matrix
gmm_image_full = cluster_pixels_gmm(image, K, full_matrix=True)
# diagonal covariance matrix
gmm_image_diag = cluster_pixels_gmm(image, K, full_matrix=False)
plot_images(
[gmm_image_full, gmm_image_diag],
["full covariance matrix", "diagonal covariance matrix"],
)
compare_matrix(image1, 5)
0%| | 0/10 [00:00<?, ?it/s]
--------------------------------------------------------------------------- TypeError Traceback (most recent call last) c:\Users\shres\Documents\Classes\CS 4641\HW2\hw2_code\HW2_Fall2023_Student.ipynb Cell 50 line 1 ----> <a href='vscode-notebook-cell:/c%3A/Users/shres/Documents/Classes/CS%204641/HW2/hw2_code/HW2_Fall2023_Student.ipynb#Y311sZmlsZQ%3D%3D?line=0'>1</a> compare_matrix(image1, 5) c:\Users\shres\Documents\Classes\CS 4641\HW2\hw2_code\HW2_Fall2023_Student.ipynb Cell 50 line 1 <a href='vscode-notebook-cell:/c%3A/Users/shres/Documents/Classes/CS%204641/HW2/hw2_code/HW2_Fall2023_Student.ipynb#Y311sZmlsZQ%3D%3D?line=6'>7</a> """ <a href='vscode-notebook-cell:/c%3A/Users/shres/Documents/Classes/CS%204641/HW2/hw2_code/HW2_Fall2023_Student.ipynb#Y311sZmlsZQ%3D%3D?line=7'>8</a> Args: <a href='vscode-notebook-cell:/c%3A/Users/shres/Documents/Classes/CS%204641/HW2/hw2_code/HW2_Fall2023_Student.ipynb#Y311sZmlsZQ%3D%3D?line=8'>9</a> image: input image of shape(H, W, 3) (...) <a href='vscode-notebook-cell:/c%3A/Users/shres/Documents/Classes/CS%204641/HW2/hw2_code/HW2_Fall2023_Student.ipynb#Y311sZmlsZQ%3D%3D?line=12'>13</a> plot: comparison between full covariance matrix and diagonal covariance matrix. <a href='vscode-notebook-cell:/c%3A/Users/shres/Documents/Classes/CS%204641/HW2/hw2_code/HW2_Fall2023_Student.ipynb#Y311sZmlsZQ%3D%3D?line=13'>14</a> """ <a href='vscode-notebook-cell:/c%3A/Users/shres/Documents/Classes/CS%204641/HW2/hw2_code/HW2_Fall2023_Student.ipynb#Y311sZmlsZQ%3D%3D?line=14'>15</a> # full covariance matrix ---> <a href='vscode-notebook-cell:/c%3A/Users/shres/Documents/Classes/CS%204641/HW2/hw2_code/HW2_Fall2023_Student.ipynb#Y311sZmlsZQ%3D%3D?line=15'>16</a> gmm_image_full = cluster_pixels_gmm(image, K, full_matrix=True) <a href='vscode-notebook-cell:/c%3A/Users/shres/Documents/Classes/CS%204641/HW2/hw2_code/HW2_Fall2023_Student.ipynb#Y311sZmlsZQ%3D%3D?line=16'>17</a> # diagonal covariance matrix <a href='vscode-notebook-cell:/c%3A/Users/shres/Documents/Classes/CS%204641/HW2/hw2_code/HW2_Fall2023_Student.ipynb#Y311sZmlsZQ%3D%3D?line=17'>18</a> gmm_image_diag = cluster_pixels_gmm(image, K, full_matrix=False) c:\Users\shres\Documents\Classes\CS 4641\HW2\hw2_code\HW2_Fall2023_Student.ipynb Cell 50 line 1 <a href='vscode-notebook-cell:/c%3A/Users/shres/Documents/Classes/CS%204641/HW2/hw2_code/HW2_Fall2023_Student.ipynb#Y311sZmlsZQ%3D%3D?line=15'>16</a> im_height, im_width, im_channel = image.shape <a href='vscode-notebook-cell:/c%3A/Users/shres/Documents/Classes/CS%204641/HW2/hw2_code/HW2_Fall2023_Student.ipynb#Y311sZmlsZQ%3D%3D?line=16'>17</a> flat_img = np.reshape(image, [-1, im_channel]).astype(np.float32) ---> <a href='vscode-notebook-cell:/c%3A/Users/shres/Documents/Classes/CS%204641/HW2/hw2_code/HW2_Fall2023_Student.ipynb#Y311sZmlsZQ%3D%3D?line=17'>18</a> gamma, (pi, mu, sigma) = GMM(flat_img, K=K, max_iters=10)(full_matrix) <a href='vscode-notebook-cell:/c%3A/Users/shres/Documents/Classes/CS%204641/HW2/hw2_code/HW2_Fall2023_Student.ipynb#Y311sZmlsZQ%3D%3D?line=18'>19</a> cluster_ids = np.argmax(gamma, axis=1) <a href='vscode-notebook-cell:/c%3A/Users/shres/Documents/Classes/CS%204641/HW2/hw2_code/HW2_Fall2023_Student.ipynb#Y311sZmlsZQ%3D%3D?line=19'>20</a> centers = mu File c:\Users\shres\Documents\Classes\CS 4641\HW2\hw2_code\gmm.py:301, in GMM.__call__(self, full_matrix, abs_tol, rel_tol, **kwargs) 298 gamma = self._E_step(pi, mu, sigma, full_matrix) 300 # M-step --> 301 pi, mu, sigma = self._M_step(gamma, full_matrix) 303 # calculate the negative log-likelihood of observation 304 joint_ll = self._ll_joint(pi, mu, sigma, full_matrix) TypeError: cannot unpack non-iterable NoneType object
In this question, you will be fitting your GMM implementation on a 2D Gaussian Mixture to estimate the parameters of the distributions that make up the mixture, and then using these estimated parameters to generate samples.
###############################
### DO NOT CHANGE THIS CELL ###
###############################
data = np.load("./data/gaussian_clusters.npy")
print(data.shape)
plt.plot(data[:, 0], data[:, 1], "x")
plt.axis("equal")
plt.title("2-D Gaussian Mixture")
plt.show()
(1000, 2)
Now, you need to estimate the parameters of the Gaussian Mixture, and then use these estimated parameters to generate 1000 samples from the Gaussian Mixture. Plot the sampled datapoints. You should notice that it resembles the original Gaussian Mixture.
Steps
Generating vs Sampling To generate points directly from a given distribution is done via Inverse transform sampling. In inverse transform sampling, we require taking the inverse of cummulative distribution function for our gaussian mixture model. This operation in general can be expensive unless there is some known formula for inverting the CDF. It is also not always possible to take the inverse of the CDF of a gaussian mixture model. For these reasons we will implement a sampling method instead. This sampling method will give us points matching the gmm without the computation and mathematical concerns of generation.
Rejection Sampling
Conventionally we think of Gaussian Mixture Models as a form of soft clustering, but you can also think of them as an algorithm for estimating density of data points with gaussians. Thus we can take an arbitary data point and using the gaussian mixture model as an estimation for the density at a given location. From here we want the points that we sample to be proportional to the density at a given location.
We go about this by, choosing an arbitary point (x,y). Then we use the density formula function $f(x_i) = \sum_{k=1}^{K} \pi(k) \mathcal{N}(x_i | \mu_k, \Sigma_k)$ to find out what the density of points is at (x,y). Now that we have the density, we can draw a random number between 0 and the maximum density to determine if we will keep or discard (x,y). If the random number drawn is less than the density, then (x,y) is our sample, otherwise we discard (x,y) and repeat. This method ensure that the samples we generate are proportional to the density predicted by our GMM at any given area.
# TODO: fit your GMM implementation to the dataset, Undergrads remember to set full_matrix to False#
gmm = GMM(data,K=3,max_iters=1000)
gamma, (pi, mu, sigma) = gmm.__call__(full_matrix=False)
# TODO: print the estimated parameters
print(pi, "\n")
print(mu, "\n")
print(sigma)
iter 999, loss: 5208.5741: 100%|██████████| 1000/1000 [00:01<00:00, 751.55it/s]
[0.10432147 0.22832289 0.66735563] [[-7.57672041 -6.99147791] [-6.54198429 -6.69435717] [ 1.09252977 5.52956925]] [[[ 0.73744254 -0. ] [-0. 1.28015016]] [[ 0.71182511 0. ] [ 0. 0.88757271]] [[14.09194696 -0. ] [-0. 13.13797808]]]
###############################
### DO NOT CHANGE THIS CELL ###
###############################
# Extract x and y
x = data[:, 0]
y = data[:, 1]
# Define the borders of the grid
deltaX = (max(x) - min(x)) / 10
deltaY = (max(y) - min(y)) / 10
xmin = min(x) - deltaX
xmax = max(x) + deltaX
ymin = min(y) - deltaY
ymax = max(y) + deltaY
# Create meshgrid
xx, yy = np.mgrid[xmin:xmax:100j, ymin:ymax:100j]
# coordinates of the points that make the grid
positions = np.vstack([xx.ravel(), yy.ravel()]).T
def density(points, pi, mu, sigma, gmm):
"""Evaluate the density at each point on the grid.
Args:
points: (N, 2) numpy array containing the coordinates of the points that make up the grid.
pi: (K,) numpy array containing the mixture coefficients for each class
mu: (K, D) numpy array containing the means of each cluster
sigma: (K, D, D) numpy array containing the covariance matrixes of each cluster
gmm: an instance of the GMM model
Return:
densities: (N, ) numpy array containing densities at each point on the grid
HINT: You should be using the formula given in the hints.
"""
#print(pi)
#print(np.shape(points))
densities = []
for i in range(gmm.K):
pdf = gmm.normalPDF(points, mu[i], sigma[i])
density = pi[i]*pdf
#print(np.sum(density))
densities.append(density)
densities = np.asarray(densities)
densities = np.transpose(densities)
densities = np.sum(densities, axis=1)
#print(np.shape(densities))
return densities
# get the density at each coordinate on the grid
densities = np.reshape(density(positions, pi, mu, sigma, gmm), xx.shape)
###############################
### DO NOT CHANGE THIS CELL ###
###############################
fig = plt.figure(figsize=(13, 7), dpi=100)
ax = plt.axes(projection="3d")
surf = ax.plot_surface(
xx, yy, densities, rstride=1, cstride=1, cmap="coolwarm", edgecolor="none"
)
ax.set_xlabel("x")
ax.set_ylabel("y")
ax.set_zlabel("PDF")
ax.set_title("Surface plot of 2D Gaussian Mixture Densities")
fig.colorbar(surf, shrink=0.5, aspect=5) # add color bar indicating the PDF
ax.view_init(60, 35)
###############################
### DO NOT CHANGE THIS CELL ###
###############################
fig = plt.figure(figsize=(13, 7), dpi=100)
ax = plt.axes(projection="3d")
w = ax.plot_wireframe(xx, yy, densities)
ax.set_xlabel("x")
ax.set_ylabel("y")
ax.set_zlabel("PDF")
ax.set_title("Wireframe plot of 2D Gaussian Mixture")
Text(0.5, 0.92, 'Wireframe plot of 2D Gaussian Mixture')
###############################
### DO NOT CHANGE THIS CELL ###
###############################
fig = plt.figure(figsize=(13, 7), dpi=100)
ax = plt.axes(projection="3d")
surf = ax.plot_surface(
xx, yy, densities, rstride=1, cstride=1, cmap="coolwarm", edgecolor="none"
)
ax.set_xlabel("x")
ax.set_ylabel("y")
ax.set_zlabel("PDF")
ax.set_title("Surface plot of 2D Gaussian Mixture Densities")
fig.colorbar(surf, shrink=0.5, aspect=5) # add color bar indicating the PDF
ax.view_init(60, 35)
###############################
### DO NOT CHANGE THIS CELL ###
###############################
fig = plt.figure(figsize=(13, 7), dpi=100)
ax = plt.axes(projection="3d")
w = ax.plot_wireframe(xx, yy, densities)
ax.set_xlabel("x")
ax.set_ylabel("y")
ax.set_zlabel("PDF")
ax.set_title("Wireframe plot of 2D Gaussian Mixture")
Text(0.5, 0.92, 'Wireframe plot of 2D Gaussian Mixture')
###############################
### DO NOT CHANGE THIS CELL ###
###############################
def rejection_sample(xmin, xmax, ymin, ymax, gmm, dmax=1, M=0.1):
"""Performs rejection sampling. Keep sampling datapoints until d <= f(x, y) / M
Args:
xmin: lower bound on x values
xmax: upper bound on x values
ymin: lower bound on y values
ymax: upper bound on y values
gmm: an instance of the GMM model
dmax: the upper bound on d
M: scale_factor. can be used to control the fraction of samples that are rejected
Return:
x, y: the coordinates of the sampled datapoint
HINT: Refer to the links in the hints
"""
while True:
x = np.random.uniform(low=xmin, high=xmax)
y = np.random.uniform(low=ymin, high=ymax)
d = np.random.uniform(low=0, high=dmax)
if d < density(np.array([x, y]).reshape(1, 2), pi, mu, sigma, gmm) / M:
return x, y
###############################
### DO NOT CHANGE THIS CELL ###
###############################
# Sample datapoints using Rejection Sampling
generated_datapoints = np.zeros((1000, 2))
i = 0
while i < 1000:
generated_datapoints[i, 0], generated_datapoints[i, 1] = rejection_sample(
xmin, xmax, ymin, ymax, gmm, dmax=1
)
if i % 100 == 0:
print(i)
i += 1
0 100 200 300 400 500 600 700 800 900
###############################
### DO NOT CHANGE THIS CELL ###
###############################
plt.scatter(generated_datapoints[:, 0], generated_datapoints[:, 1])
plt.axis("equal")
plt.title("Sampled Datapoints")
plt.show()
Learning to work with messy data is a hallmark of a well-rounded data scientist. In most real-world settings the data given will usually have some issue, so it is important to learn skills to work around such impasses. This part of the assignment looks to expose you to clever ways to fix data using concepts that you have already learned in the prior questions.
After graduating from Georgia Tech with your shiny new degree, you are recruited to help with safety testing for the Mars rocket at NASA. Of course NASA won't be sending rocket after rocket to stress test your fellow employees' engineering (they also graduated from Tech, so you have full confidence in them), so instead, NASA has decided to run numerous simulations on the current engineering design of the Mars rocket. The simulation collects shuttle data from its sensors, resulting in 10 features which include bypass, rad flow, etc. These features are contained within the first through tenth columns. The eleventh column shows the label with 1 being a successful simulation and 0 being an unsuccessful simulation.
However, due to an intern accidentally deleting random data points, 20% of the entries are missing labels and 30% are missing characterization data. Since simply removing the corrupted entries would not reflect the true variance of the data, your job is to implement a solution to clean the data so it can be properly classified. Thankfully there is a maximum of one missing datapoint for each entry, which means this datapoint can be approximated using the uncorrupted data.
Your job is to assist NASA in cleaning the data and implementing a semi-supervised learning framework to help them create a general classifier for future simulations.
You are given two files for this task:
The first step is to break up the whole dataset into clear parts. All the data is randomly shuffled in one csv file. In order to move forward, the data needs to be split into three separate arrays:
In semisupervised.py, implement the following methods:

###############################
### DO NOT CHANGE THIS CELL ###
###############################
localtests.SemisupervisedTests().test_data_separating_methods()
The second step in this task is to clean the Labeled_incomplete dataset by filling in the missing values with probable ones derived from complete data. A useful approach to this type of problem is using a k-nearest neighbors (k-NN) algorithm. For this application, the method consists of replacing the missing value of a given point with the mean of the closest k-neighbors to that point. Given that you are focusing on neighbouring points, the margin of error from actual missing values should be limited.
In the CleanData class in semisupervised.py, implement the following methods:
The unit test is a good expectation of what the process should look like on a toy dataset. If your output matches the answer, you are on the right track. Run the following cell to check.
NOTE: Your rows of data should match with the expected output, although the order of the rows does not necessarily matter.
###############################
### DO NOT CHANGE THIS CELL ###
###############################
localtests.SemisupervisedTests().test_cleandata()
Another method of filling the missing values is by using the mean of individual features. The mean of all non-NaN values of a feature could be used to replace any NaN value belonging to the feature. Implement the mean_clean_data method in accordance with this rule. NOTE: There should be no NaN values in the n*d array that you return from mean_clean_data.
In semisupervised.py, implement the following method:
###############################
### DO NOT CHANGE THIS CELL ###
###############################
localtests.SemisupervisedTests().test_mean_clean_data()
Take a look at the algorithm presented in Table 1 of the paper "Text Classification from Labeled and Unlabeled Documents using EM" by Nigam et al. (2000). While you are recommended to read the whole paper this assignment focuses on items 5.1, 5.2, and 6.1. Write a brief summary of three interesting highlights of the paper (50-words maximum).
Implement the EM algorithm proposed by Nigam et al. (2000) on Table 1, using a Gaussian Naive Bayes (GNB) classifier instead of a Naive Bayes (NB) classifier. (Hint: Using a GNB in place of an NB will enable you to reuse most of the implementation you developed for GMM in this assignment. In fact, you can successfully solve the problem by simply modifying the call and _init_components methods.)
In the SemiSupervised class in semisupervised.py, implement the following methods:
Compare the classification error based on the Gaussian Naive Bayes (GNB) classifier you implemented following the Nigam et al. (2000) approach to the performance of a GNB classifier trained using only labeled data. Since you have not covered supervised learning in class, you are allowed to use the scikit learn library for training the GNB classifier based only on labeled data: https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.GaussianNB.html.
In the ComparePerformance class in semisupervised.py, implement the following method:
To acheive the full 5 points you must implement the ComparePerformance.accuracy_semi_supervised and ComparePerformance.accuracy_GNB methods and get these scores:
###############################
### DO NOT CHANGE THIS CELL ###
###############################
from semisupervised import complete_
from semisupervised import incomplete_
from semisupervised import unlabeled_
from semisupervised import mean_clean_data
from semisupervised import CleanData
from semisupervised import ComparePerformance
###############################
### DO NOT CHANGE THIS CELL ###
###############################
# Load training data
all_data = np.loadtxt("data/data.csv", delimiter=",")
# Separate training data into categories: labeled complete, labeled incomplete, and unlabeled points
labeled_complete = complete_(all_data)
labeled_incomplete = incomplete_(all_data)
unlabeled = unlabeled_(all_data)
# Perform data cleaning on labeled incomplete data
cleaned_data = CleanData()(labeled_incomplete, labeled_complete, 10)
# Combine cleaned data with unlabeled data
cleaned_and_unlabeled = np.concatenate((cleaned_data, unlabeled), 0)
# Category for data that is guaranteed to have label values
labeled_data = np.concatenate((labeled_complete, labeled_incomplete), 0)
# Perform mean data cleaning on all labeled data
mean_cleaned_data = mean_clean_data(labeled_data)
# Print data shapes
print(f"All Data shape: {all_data.shape}")
print(f"Labeled Complete shape: {labeled_complete.shape}")
print(f"Labeled Incomplete shape: {labeled_incomplete.shape}")
print(f"Labeled shape: {labeled_data.shape}")
print(f"Unlabeled shape: {unlabeled.shape}")
print(f"Cleaned data shape: {cleaned_data.shape}")
print(f"Cleaned + Unlabeled data shape: {cleaned_and_unlabeled.shape}")
# load validation data
validation = np.loadtxt("data/validation.csv", delimiter=",")
# =========================================================================
# SUPERVISED GNB WITH ONLY THE COMPLETE DATA (SKLEARN)
accuracy_complete_data_only = ComparePerformance.accuracy_GNB(
labeled_complete, validation
)
# =========================================================================
# SUPERVISED GNB WITH CLEAN DATA (SKLEARN)
accuracy_cleaned_data = ComparePerformance.accuracy_GNB(cleaned_data, validation)
# =========================================================================
# SUPERVISED GNB WITH MEAN CLEAN DATA (SKLEARN)
accuracy_mean_cleaned_data = ComparePerformance.accuracy_GNB(
mean_cleaned_data, validation
)
# =========================================================================
# SEMI SUPERVISED GNB WITH ALL DATA (your implementation)
accuracy_semi_supervised = ComparePerformance.accuracy_semi_supervised(
cleaned_and_unlabeled, validation, 2
)
# ==========================================================================
# COMPARISON
print("""===COMPARISON===""")
print(
f"Supervised with only complete data, GNB Accuracy: {np.round(100.0 * accuracy_complete_data_only, 3)}%"
)
print(
f"Supervised with KNN clean data, GNB Accuracy: {np.round(100.0 * accuracy_cleaned_data, 3)}%"
)
print(
f"Supervised with Mean clean data, GNB Accuracy: {np.round(100.0 * accuracy_mean_cleaned_data, 3)}%"
)
print(
f"SemiSupervised Accuracy: {np.round(100.0 * accuracy_semi_supervised, 3)}%"
)
What are the differences in using the kNN method and the mean method to fill NaN values? Explain in terms of the results you get from each.
Answers: Multiple answers acceptable - Anything to the effect of explaining how the dataset changes or why the accuracy from the model should respond to the different methods. Eg. the quality of data might be better with the kNN method as it looks for mean from only relevant subsection of values instead of all the values like the mean method.